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"Fault attacks have recently become a serious concern in the smart card industry. 
"Fault Tolerant Systems" provides the reader with a clear exposition of these at- 
tacks and the protection strategies that can be used to thwart them. A must read 
for practitioners and researchers working in the field." 

David Naccache, Ecole normale superieure 

"Understanding the fundamentals of an area, whether it is golf or fault tolerance, 
is a prerequisite to developing expertise in the area. Krishna and Koren's book can 
provide a reader with this underlying foundation for fault tolerance. This book 
is particularly timely because the design of fault-tolerant computing components, 
such as processors and disks, is becoming increasingly important to the main- 
stream computing industry." 

Shubu Mukherjee, Director, FACT- AMI Group, Intel Corporation 

"Professors Koren and Krishna, have written a modern, dual purpose text that 
first presents the basics fault tolerance tools describing various redundancy types 
both at the hardware and software levels followed by current research topics. It 
reviews fundamental reliability modeling approaches, combinatorial blocks and 
Markov chain techniques. Notably, there is a complete chapter on statistical sim- 
ulation methods that offers guidance to practical evaluations as well as one on 
fault-tolerant networks. All chapters, which are clearly written including illumi- 
nating examples, have extensive reference lists whereby students can delve deeper 
into almost any topic. Several practical and commercial computing systems that 
incorporate fault tolerance are detailed. Furthermore, there are two chapters in- 
troducing current fault tolerance research challenges, cryptographic systems and 
defects in VLSI designs." 

Robert Redinbo, UC Davis 

"The field of Fault-Tolerant Computing has advanced considerably in the past ten 
years and yet no effort has been made to put together these advances in the form 
of a book or a comprehensive paper for the students starting in this area. This is 
the first book I know of in the past 10 years that deals with hardware and software 
aspects of fault tolerant computing, is very comprehensive, and is written as a text 
for the course." 

Kewal Saluja, University of Wisconsin, Madison 
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Foreword 



Systems used in critical applications such as health, commerce, transportation, 
utilities, and national security must be highly reliable. Ubiquitous use of com- 
puting systems and other electronic systems in these critical areas requires that 
computing systems have high reliability High reliability is achieved by designing 
the systems to be fault-tolerant. Even though the high reliability requirements of 
computing systems gave the original impetus to the study of the design of fault- 
tolerant systems, trends in manufacturing of VLSI circuits and systems are also 
requiring the use of fault-tolerant design methods to achieve high yields from 
manufacturing plants. This is due to the fact that with reduced feature sizes of 
VLSI circuit designs and shortcomings of lithographic techniques used in fabrica- 
tion the characteristics of the manufactured devices are becoming unpredictable. 
Additionally small sizes of devices make them susceptible to radiation induced 
failures causing run time errors. Thus it may be necessary to use fault tolerance 
techniques even in systems that are used in non-critical applications such as con- 
sumer electronics. 

This book covers comprehensively the design of fault-tolerant hardware and 
software, use of fault-tolerance techniques to improve manufacturing yields and 
design and analysis of networks. Additionally it includes material on methods to 
protect against threats to encryption subsystems used for security purposes. The 
material in the book will help immensely students and practitioners in electrical 
and computer engineering and computer science in learning how to design reli- 
able computing systems and how to analyze fault-tolerant computing systems. 

Sudlmkar M. Reddy 

Distinguished Professor of Electrical and Computer Engineering 

University oflozva Foundation 
Iowa City , Iozva 




Preface 



The purpose of this book is to provide a solid introduction to the rich field of fault- 
tolerant computing. Its intended use is as a text for senior-level undergraduate and 
first-year graduate students, as well as a reference for practicing engineers in the 
industry. Since it would be impossible to cover in one book all the fault-tolerance 
techniques and practices that have been developed or are currently in use, we 
have focused on providing the basics of the field and enough background to allow 
the reader to access more easily the rapidly expanding fault-tolerance literature. 
Readers who are interested in further details should consult the list of references 
at the end of each chapter. To understand this book well, the reader should have 
a basic knowledge of hardware design and organization, principles of software 
development, and probability theory. 

The book has 10 chapters; each chapter has a list of relevant references and a 
set of exercises. Solutions to the exercises are available on-line and access to them 
is provided by the publisher upon request to instructors who adopt this book as a 
textbook for their class. Powerpoint slides for instructors are also available. 

The book starts with an outline of preliminaries, in which we provide introduc- 
tory information. This is followed by a set of six chapters that form the core of 
what we believe should be covered in any introduction to fault-tolerant systems. 

Chapter 2 deals with hardware fault-tolerance; this is the discipline with the 
longest history (indeed, the idea of using hardware redundancy for fault-tolerance 
goes back to the very pioneers of computing, most notably von Neumann). We also 
include in this chapter an introduction to some of the probabilistic tools used in 
analyzing reliability measures. 

Chapter 3 deals with information redundancy with the main focus on error 
detecting and correcting codes. Such codes, like hardware fault-tolerance, go back 
a very long way, and were motivated in large measure by the need to counter 
errors in information transmission. The same, or similar, techniques are being used 
today in other applications as well, principally in contemporary memory circuits. 
We have sought to provide a survey of only the more important coding techniques. 
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and it was not intended to be comprehensive: indeed, a comprehensive survey 
of coding would require multiple volumes. Following this, we turn to the issue 
of managing information redundancy in storage, and end with algorithm-based 
fault-tolerance. 

Chapter 4 covers fault-tolerant networks. With processors becoming cheaper, 
distributed systems are becoming more commonplace; we look at some key net- 
work topologies and consider how to quantify and enhance their fault-tolerance. 

Chapter 5 describes techniques for software fault-tolerance. It is widely be- 
lieved that software accounts for a majority of the failures seen in today's com- 
puter systems. As a field, software fault-tolerance is less mature than fault- 
tolerance using either hardware or information redundancy. It is also a much 
harder nut to crack. Software is probably the most complex artificial construct 
that people have created, and rendering it fault-tolerant is an arduous task. We 
cover such techniques as recovery blocks and N-version programming, together 
with a discussion of acceptance tests and ways to model software failure processes 
analytically. 

In Chapter 6, we cover the use of time-redundancy through checkpointing. The 
majority of hardware failures are transient in nature; in other words, they are fail- 
ures which simply go away after some time. An obvious response to such failures 
is to roll back the execution and re-execute the program. Checkpointing is a tech- 
nique which allows us to limit the extent of such re-executions. 

Chapter 7, which contains several case studies, rounds off the core of the book. 
There, we describe several real-life examples of fault-tolerant systems illustrating 
the usage of the various techniques presented in the previous chapters. 

The remaining three chapters of the book deal with more specialized topics. In 
Chapter 8, we cover defect-tolerance in VLSI. As die sizes increase and feature 
sizes drop, it is becoming increasingly important to be able to tolerate manufac- 
turing defects in a VLSI chip without affecting its functionality. We discuss the key 
approaches being used, as well as the underlying mathematical models. 

In Chapter 9, we focus on cryptographic devices. The increasing use of com- 
puters in commerce, including smart cards and Internet shopping, has motivated 
the use of encryption in everyday applications. It turns out that injecting faults 
into cryptographic devices and observing the outputs is an effective way to attack 
secure systems and obtain their secret key. We present in this chapter the use of 
fault-detection to counter these types of security attacks. 

Chapter 10, which ends the book, deals with simulation and experimental tech- 
niques. Simulating a fault-tolerant system to measure its reliability is often com- 
putationally very demanding. We provide in this chapter an outline of basic sim- 
ulation techniques, as well as ways in which simulation can be accelerated. Also 
provided are basic statistical tools by which simulation output can be analyzed 
and an outline of experimental fault-injection techniques. 

A companion web site (www.ecs.umass.edu/ece/koren/FaultTolerantSyst- 
ems/) includes additional resources for the book such as lecture slides, the in- 
evitable list of errors, and, more importantly, a link to an extensive collection of 
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educational tools and simulators that can be of great assistance to the readers of 
the book. Elsevier also maintains an instructor web site that will house the solu- 
tions for those who adopt this book as a textbook for their class. The website can 
be found at http://textbooks.elsevier.com. 
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Preliminaries 



The past 50 years have seen computers move from being expensive computa- 
tional engines used by government and big corporations to becoming an every- 
day commodity, deeply embedded in practically every aspect of our lives. Not 
only are computers visible everywhere, in desktops, laptops, and PDAs, it is also 
a commonplace that they are invisible everywhere, as vital components of cars, 
home appliances, medical equipment, aircraft, industrial plants, and power gen- 
eration and distribution systems. Computer systems underpin most of the world's 
financial systems: given current transaction volumes, trading in the stock, bond, 
and currency markets would be unthinkable without them. Our increasing will- 
ingness, as a society, to place computers in life-critical and wealth-critical applica- 
tions is largely driven by the increasing possibilities that computers offer. And yet, 
as we depend more and more on computers to carry out all of these vital actions, 
we are — implicitly or explicitly — gambling our lives and property on computers 
doing their jobs properly. 

Computers (hardware plus software) are quite likely the most complex systems 
ever created by human beings. The complexity of computer hardware is still in- 
creasing as designers attempt to exploit the higher transistor density that new 
generations of technology make available to them. Computer software is far more 
complex still, and with that complexity comes an increased propensity to failure. It 
is probably fair to say that there is not a single large piece of software or hardware 
today that is free of bugs. Even the space shuttle, with software that was devel- 
oped and tested using some of the best and most advanced techniques known to 
engineering, is now known to have flown with bugs that had the potential to cause 
catastrophe. 

Computer scientists and engineers have responded to the challenge of design- 
ing complex systems with a variety of tools and techniques to reduce the number 
of faults in the systems they build. However, that is not enough: we need to build 
systems that will acknowledge the existence of faults as a fact of life, and incorpo- 
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CHAPTER 1 Preliminaries 



rate techniques to tolerate these faults while still delivering an acceptable level of 
service. The resulting field of fault tolerance is the subject of this book. 



1.1 Fault Classification 

In everyday language, the terms fault, failure, and error are used interchangeably. 
In fault-tolerant computing parlance, however, they have distinctive meanings. 
A fault (or failure) can be either a hardware defect or a software/programming 
mistake (bug). In contrast, an error is a manifestation of the fault/ failure/bug. 

As an example, consider an adder circuit, with an output line stuck at 1; it al- 
ways carries the value 1 independently of the values of the input operands. This 
is a fault, but not (yet) an error. This fault causes an error when the adder is used 
and the result on that line is supposed to have been a 0, rather than a 1. A similar 
distinction exists between programming mistakes and execution errors. Consider, 
for example, a subroutine that is supposed to compute sin(x) but owing to a pro- 
gramming mistake calculates the absolute value of sin(x) instead. This mistake 
will result in an execution error only if that particular subroutine is used and the 
correct result is negative. 

Both faults and errors can spread through the system. For example, if a chip 
shorts out power to ground, it may cause nearby chips to fail as well. Errors can 
spread because the output of one unit is used as input by other units. To return 
to our previous examples, the erroneous results of either the faulty adder or the 
sin(x) subroutine can be fed into further calculations, thus propagating the error. 

To limit such contagion, designers incorporate containment zones into systems. 
These are barriers that reduce the chance that a fault or error in one zone will 
propagate to another. For example, a fault-containment zone can be created by en- 
suring that the maximum possible voltage swings in one zone are insulated from 
the other zones, and by providing an independent power supply to each zone. In 
other words, the designer tries to electrically isolate one zone from another. An 
error-containment zone can be created, as we will see in some detail later on, by 
using redundant units /programs and voting on their output. 

Flardware faults can be classified according to several aspects. Regarding their 
duration, hardware faults can be classified into permanent, transient, or intermittent. 
A permanent fault is just that: it reflects the permanent going out of commission of 
a component. As an example of a permanent fault think of a burned-out lightbulb. 
A transient fault is one that causes a component to malfunction for some time; it 
goes away after that time and the functionality of the component is fully restored. 
As an example, think of a random noise interference during a telephone conversa- 
tion. Another example is a memory cell with contents that are changed spuriously 
due to some electromagnetic interference. The cell itself is undamaged: it is just 
that its contents are wrong for the time being, and overwriting the memory cell 
will make the fault go away. An intermittent fault never quite goes away entirely; 
it oscillates between being quiescent and active. When the fault is quiescent, the 
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component functions normally; when the fault is active, the component malfunc- 
tions. An example for an intermittent fault is a loose electrical connection. 

Another classification of hardware faults is into benign and malicious faults. 
A fault that just causes a unit to go dead is called benign. Such faults are the eas- 
iest to deal with. Far more insidious are the faults that cause a unit to produce 
reasonable-looking, but incorrect, output, or that make a component "act mali- 
ciously" and send differently valued outputs to different receivers. Think of an 
altitude sensor in an airplane that reports a 1000-foot altitude to one unit and a 
8000-foot altitude to another unit. These are called malicious (or Byzantine) faults. 



Types of Redundancy 

All of fault tolerance is an exercise in exploiting and managing redundancy . Redun- 
dancy is the property of having more of a resource than is minimally necessary to 
do the job at hand. As failures happen, redundancy is exploited to mask or other- 
wise work around these failures, thus maintaining the desired level of functional- 

ity- 

There are four forms of redundancy that we will study: hardware, software, 
information, and time. Hardware faults are usually dealt with by using hardware, 
information, or time redundancy, whereas software faults are protected against by 
software redundancy. 

Hardware redundancy is provided by incorporating extra hardware into the 
design to either detect or override the effects of a failed component. For example, 
instead of having a single processor, we can use two or three processors, each per- 
forming the same function. By having two processors, we can detect the failure 
of a single processor; by having three, we can use the majority output to override 
the wrong output of a single faulty processor. This is an example of static hardware 
redundancy, the main objective of which is the immediate masking of a failure. A 
different form of hardware redundancy is dynamic redundancy, where spare com- 
ponents are activated upon the failure of a currently active component. A combi- 
nation of static and dynamic redundancy techniques is also possible, leading to 
hybrid hardware redundancy. 

Hardware redundancy can thus range from a simple duplication to complicated 
structures that switch in spare units when active ones become faulty. These forms 
of hardware redundancy incur high overheads, and their use is therefore normally 
reserved for critical systems where such overheads can be justified. In particu- 
lar, substantial amounts of redundancy are required to protect against malicious 
faults. 

The best-known form of information redundancy is error detection and correc- 
tion coding. Here, extra bits (called check bits) are added to the original data bits 
so that an error in the data bits can be detected or even corrected. The resulting 
error-detecting and error-correcting codes are widely used today in memory units 
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and various storage devices to protect against benign failures. Note that these er- 
ror codes (like any other form of information redundancy) require extra hardware 
to process the redundant data (the check bits). 

Error-detecting and error-correcting codes are also used to protect data commu- 
nicated over noisy channels, which are channels that are subject to many transient 
failures. These channels can be either the communication links among widely sep- 
arated processors (e.g., the Internet) or among locally connected processors that 
form a local network. If the code used for data communication is capable of only 
detecting the faults that have occurred (but not correcting them), we can retrans- 
mit as necessary, thus employing time redundancy. 

In addition to transient data communication failures due to noise, local and 
wide-area networks may experience permanent link failures. These failures may 
disconnect one or more existing communication paths, resulting in a longer com- 
munication delay between certain nodes in the network, a lower data bandwidth 
between certain node pairs, or even a complete disconnection of certain nodes 
from the rest of the network. Redundant communication links (i.e., hardware re- 
dundancy) can alleviate most of these problems. 

Computing nodes can also exploit time redundancy through re-execution of 
the same program on the same hardware. As before, time redundancy is effec- 
tive mainly against transient faults. Because the majority of hardware faults are 
transient, it is unlikely that the separate executions will experience the same fault. 
Time redundancy can thus be used to detect transient faults in situations in which 
such faults may otherwise go undetected. Time redundancy can also be used when 
other means for detecting errors are in place and the system is capable of recov- 
ering from the effects of the fault and repeating the computation. Compared with 
the other forms of redundancy, time redundancy has much lower hardware and 
software overhead but incurs a high performance penalty. 

Software redundancy is used mainly against software failures. It is a reasonable 
guess that every large piece of software that has ever been produced has contained 
faults (bugs). Dealing with such faults can be expensive: one way is to indepen- 
dently produce two or more versions of that software (preferably by disjoint teams 
of programmers) in the hope that the different versions will not fail on the same 
input. The secondary version(s) can be based on simpler and less accurate algo- 
rithms (and, consequently, less likely to have faults) to be used only upon the 
failure of the primary software to produce acceptable results. Just as for hard- 
ware redundancy, the multiple versions of the program can be executed either 
concurrently (requiring redundant hardware as well) or sequentially (requiring 
extra time, i.e., time redundancy) upon a failure detection. 



1 .3 Basic Measures of Fault Tolerance 

Because fault tolerance is about making machines more dependable, it is impor- 
tant to have proper measures (yardsticks) by which to gauge such dependability. 
In this section, we will examine some of these yardsticks and their application. 
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A measure is a mathematical abstraction that expresses some relevant facet of 
the performance of its object. By its very nature, a measure only captures some 
subset of the properties of an object. The trick in defining a suitable measure is to 
keep this subset large enough so that behaviors of interest to the user are captured, 
and yet not so large that the measure loses focus. 

1.3.1 Traditional Measures 

We first describe the traditional measures of dependability of a single computer. 
These metrics have been around for a long time and measure very basic attributes 
of the system. Two of these measures are reliability and availability. 

The conventional definition of reliability, denoted by R(f), is the probability (as 
a function of the time t ) that the system has been up continuously in the time 
interval [0, t]. This measure is suitable for applications in which even a momen- 
tary disruption can prove costly. One example is computers that control physical 
processes such as an aircraft, for which failure would result in catastrophe. 

Closely related to reliability are the Mean Time to Failure, denoted by MTTF, 
and the Mean Time Between Failures, MTBF. The first is the average time the system 
operates until a failure occurs, whereas the second is the average time between two 
consecutive failures. The difference between the two is due to the time needed to 
repair the system following the first failure. Denoting the Mean Time to Repair by 
MTTR, we obtain 

MTBF = MTTF + MTTR 

Availability, denoted by A(t), is the average fraction of time over the interval 
[0, t] that the system is up. This measure is appropriate for applications in which 
continuous performance is not vital but where it would be expensive to have the 
system down for a significant amount of time. An airline reservation system needs 
to be highly available, because downtime can put off customers and lose sales; 
however, an occasional (very) short-duration failure can be well tolerated. 

The long-term availability, denoted by A, is defined as 

A = lim A(t ) 

t—>o O 

A can be interpreted as the probability that the system will be up at some ran- 
dom point in time, and is meaningful only in systems that include repair of faulty 
components. The long-term availability can be calculated from MTTF, MTBF, and 
MTTR as follows: 

MTTF MTTF 

~~ MTBF “ MTTF + MTTR 

A related measure. Point Availability, denoted by A p (t), is the probability that 
the system is up at the particular time instant t. 

It is possible for a low-reliability system to have high availability: consider a 
system that fails every hour on the average but comes back up after only a second. 
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Such a system has an MTBF of just 1 hour and, consequently, a low reliability; 
however, its availability is high: A — 3599/3600 = 0.99972. 

These definitions assume, of course, that we have a state in which the system 
can be said to be “up" and another in which it is not. For simple components, this is 
a good assumption. For example, a lightbulb is either good or burned out. A wire 
is either connected or has a break in it. Flowever, for even simple systems, such an 
assumption can be very limiting. For example, consider a processor that has one 
of its several hundreds of millions of gates stuck at logic value 0. In other words, 
the output of this logic gate is always 0, regardless of the input. Suppose the rest 
of the processor is functional, and that this failed logic gate only affects the output 
of the processor about once in every 25,000 hours of use. For example, a particular 
gate in the divide unit when being faulty may result in a wrong quotient if the 
divisor is within a certain subset of values. Clearly, the processor is not fault-free, 
but would one define it as "down"? 

The same remarks apply with even greater force to systems that degrade grace- 
fully. By this, we mean systems with various levels of functionality. Initially, with 
all of its components operational, the system is at its highest level of functionality. 
As these components fail, the system degrades from one level of functionality to 
the next. Beyond a certain point, the system is unable to produce anything of use 
and fails completely. As with the previous example, the system has multiple "up" 
states. Is it said to fail when it degrades from full to partial functionality? Or when 
it fails to produce any useful output at all? Or when its functionality falls below a 
certain threshold? If the last, what is this threshold, and how is it determined? 

We can therefore see that traditional reliability and availability are very limited 
in what they can express. There are obvious extensions to these measures. For 
example, we may consider the average computational capacity of a system with n 
processors. Let c; denote the computational capacity of a system with i operational 
processors. This can be a simple linear function of the number of processors, c,- = 
ici, or a more complex function of i, depending on the ability of the application to 
utilize i processors. The Average Computational Capacity of the system can then be 
defined as Y^i=i c /P/(0/ where P,(f) is the probability that exactly i processors are 
operational at time t. In contrast, the reliability of the system at time t will be 

n 

R(t) = J2 p i(t) 

i=m 

where m is the minimum number of processors necessary for proper operation of 
the system. 

1.3.2 Network Measures 

In addition to the general system measures previously discussed, there are also 
more specialized measures, focusing on the network that connects the processors 
together. The simplest of these are classical node and line connectivity, which are 
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Network N1 Network N2 

FIGURE 1.1 Inadequacy of classical connectivity. 

defined as the minimum number of nodes and lines, respectively, that have to fail 
before the network becomes disconnected. This gives a rough indication of how 
vulnerable a network is to disconnection: for example, a network that can be dis- 
connected by the failure of just one (critically positioned) node is potentially more 
vulnerable than another that requires at least four nodes to fail before it becomes 
disconnected. 

Classical connectivity is a very basic measure of network reliability. Like reli- 
ability, it distinguishes between only two network states: connected and discon- 
nected. It says nothing about how the network degrades as nodes fail before, or 
after, becoming disconnected. Consider the two networks shown in Figure 1.1. 
Both networks have the same classical node connectivity of 1. However, in a real 
sense, network N 1 is much more "connected" than N 2. The probability that N 2 
splinters into small pieces is greater than that for N 1. 

To express this type of "connectivity robustness," we can use additional mea- 
sures. Two such measures are the average node-pair distance, and the network di- 
ameter (the maximum node-pair distance), both calculated given the probability 
of node and / or link failure. Such network measures, together with the traditional 
measures listed above, allow us to gauge the dependability of various networked 
systems that consist of computing nodes connected through a network of commu- 
nication links. 

1 .4 Outline of This Book 

The next chapter is devoted to hardware fault tolerance. This is the most estab- 
lished topic within fault-tolerant computing, and many of the basic principles and 
techniques that have been developed for it have been extended to other forms 
of fault tolerance. Techniques to evaluate the reliability and availability of fault- 
tolerant systems are introduced here, including the use of Markov models. 

Next, several variations of information redundancy are covered, starting with 
the most widely used error detecting and correcting codes. Then, other forms of 
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information redundancy are discussed, including storage redundancy (RAID sys- 
tems), data replication in distributed systems, and, finally, the algorithm-based 
fault-tolerance technique that tolerates data errors in array computations using 
some error-detecting and error-correcting codes. 

Many computing systems nowadays consist of multiple networked proces- 
sors that are subject to interconnection link failures, in addition to the already- 
discussed single node/processor failures. We, therefore, present in this book suit- 
able fault tolerance techniques for these networks and analysis methods to deter- 
mine which network topologies are more robust. 

Software mistakes /bugs are, in practice, unavoidable, and consequently, some 
level of software fault tolerance is a must. This can be as simple as acceptance tests 
to check the reasonableness of the results before using them, or as complex as run- 
ning two or more versions of the software (sequentially or in parallel). Programs 
also tend to have their state deteriorate after running for long periods of time and 
eventually crash. This situation can be avoided by periodically restarting the pro- 
gram, a process called rejuvenation. Unlike hardware faults, software faults are 
very hard to model. Still, a few such models have been developed and several of 
them are described. 

Hardware fault-tolerance techniques can be quite costly to implement. In ap- 
plications in which a complete and immediate masking of the effect of hardware 
faults (especially, of transient nature) is not necessary, checkpointing is an inexpen- 
sive alternative. For programs that run for a long time and for which re-execution 
upon a failure might be too costly, the program state can be saved (once or periodi- 
cally) during the execution. Upon a failure occurrence, the system can roll back the 
program to the most recent checkpoint and resume its execution from that point. 
Various checkpointing techniques are presented and analyzed in the book. 

Case studies illustrating the use of many of the fault-tolerance techniques de- 
scribed previously are presented, including Tandem, Stratus, Cassini, and micro- 
processors from IBM and Intel. 

Two fault-tolerance topics that are rapidly increasing in practical importance, 
namely, defect tolerant VLSI design and fault tolerance in cryptographic devices 
are discussed. The increasing complexity of VLSI chip design has resulted in a 
situation in which manufacturing defects are unavoidable. If nothing is done to 
remedy this situation, the expected yield (the fraction of manufactured chips which 
are operational) will be very low. Thus, techniques to reduce the sensitivity of 
VLSI chips to defects have been developed, some of which are very similar to the 
hardware redundancy schemes. 

For cryptographic devices, the need for fault tolerance is two-fold. Not only 
is it crucial that such devices (e.g., smart cards) operate in a fault-free manner in 
whatever environment they are used, but more importantly, they must stay secure. 
Fault-injection-based attacks on cryptographic devices have become the simplest 
and fastest way to extract the secret key from the device. Thus, the incorporation 
of fault tolerance is a must in order to keep cryptographic devices secure. 
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An important part of the design and evaluation process of a fault-tolerant sys- 
tem is to demonstrate that the system does indeed function at the advertised 
level of reliability Often the designed system is too complex to develop analyt- 
ical expressions of its reliability If a prototype of the system has already been 
constructed, then fault-injection experiments can be performed and certain de- 
pendability attributes measured. If, however, as is very common, a prototype does 
not yet exist, statistical simulation must be used. Simulation programs for com- 
plex systems must be carefully designed to produce accurate results. We discuss 
the principles that should be followed when preparing a simulation program, and 
show how simulation results can be analyzed to infer system reliability. 



1.5 Further Reading 

Several textbooks and reference books on the topic of fault tolerance have been 
published in the past. See, for example, [2,4,5,9,10,13-16]. Journals have published 
several special issues on fault-tolerant computing, e.g., [7,8]. The major conference 
in the field is the Conference on Dependable Systems and Netzvorks (DSN) [3]; this is a 
successor to the Fault-Tolerant Computing Symposium (FTCS). 

The concept of computing being invisible everywhere appeared in [19], in the 
context of pervasive computing, that is, computing that pervades everyday living, 
without being obtrusive. 

The definitions of the basic terms and measures appear in most of the text- 
books mentioned above and in several probability and statistics books. For exam- 
ple, see [18]. Our definitions of fault and error are slightly different from those 
used in some of the references. A generally used definition of an error is that it is 
that part of the system state that leads to system failure. Strictly interpreted, this 
only applies to a system with state, i.e., with memory. We use the more encompass- 
ing definition of anything that can be construed as a manifestation of a fault. This 
wider interpretation allows purely combinational circuits, which are stateless, to 
generate errors. 

One measure of dependability that we did not describe in the text is to con- 
sider everything from the perspective of the application. This approach was taken 
to define the measure known as performability . The application is used to define 
"accomplishment levels" Li,L?, . . . , L n . Each of these represents a level of quality 
of service delivered by the application. For example, L, may be defined as follows: 
"There are i system crashes during the time period [0, T]." Now, the performance 
of the computer affects this quality (if it did not, by definition, it would have noth- 
ing to do with the application!). The approach taken by performability is to link 
the performance of the computer to the accomplishment level that this enables. 
Performability is then a vector, ( P(L\),P(L 2 ), . . . ,P(L„)), where P(L;) is the probabil- 
ity that the computer functions well enough to permit the application to reach up 
to accomplishment level L,-. For more on performability, see [6,11,12]. 
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Hardware Fault 
Tolerance 



Hardware fault tolerance is the most mature area in the general field of fault- 
tolerant computing. Many hardware fault-tolerance techniques have been devel- 
oped and used in practice in critical applications ranging from telephone ex- 
changes to space missions. In the past, the main obstacle to a wide use of hardware 
fault tolerance has been the cost of the extra hardware required. With the contin- 
ued reduction in the cost of hardware, this is no longer a significant drawback, and 
the use of hardware fault-tolerance techniques is expected to increase. However, 
other constraints, notably on power consumption, may continue to restrict the use 
of massive redundancy in many applications. 

This chapter first discusses the rate at which hardware failures occur, as well 
as its effect on the reliability of a single component. We then extend the discus- 
sion to more complex systems consisting of multiple components, describe vari- 
ous resilient structures which have been proposed and implemented, and evalu- 
ate their reliability and / or availability. Next, we describe hardware fault-tolerance 
techniques that have been developed specifically for general-purpose processors. 
Finally, we discuss malicious faults and investigate the amount of redundancy 
needed for tolerating them. 



2.1 The Rate of Hardware Failures 

The single most important parameter used in the reliability analysis of hardware 
systems is the component failure rate, which is the rate at which an individual 
component suffers faults. This is the expected number of failures per unit time 
that a currently good component will suffer in a given future time interval. The 
failure rate depends on the current age of the component, any voltage or physical 
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FIGURE 2.1 Bathtub curve. 



shocks that it suffers, the ambient temperature, and the technology. The depen- 
dence on age is usually captured by what is known as the bathtub curve (see Fig- 
ure 2.1). When components are very young, their failure rate is quite high. This 
is due to the chance that some components with manufacturing defects slipped 
through manufacturing quality control and were released. As time goes on, these 
components are weeded out, and the component spends the bulk of its life show- 
ing a fairly constant failure rate. As it becomes very old, aging effects start to take 
over, and the failure rate rises again. 

The impact of the other factors can be expressed through the following empiri- 
cal failure rate formula: 



X — TtrTXQiCiTcjTxv + Czke) ( 2 . 1 ) 

where the notations are as follows: 

X Failure rate of component. 

jtl Learning factor, associated with how mature the technology is. 

7 Tn Quality factor, representing manufacturing process quality control (rang- 
ing from 0.25 to 20.00). 

ttj Temperature factor, with values ranging from 0.1 to 1000. It is propor- 
tional to e -Ea ^ T , where E a is the activation energy in electron-volts 
associated with the technology, k is the Boltzmann constant (0.8625 x 
10 -4 eV /K), and T is the temperature in Kelvin. 

Try Voltage stress factor for CMOS devices; can range from 1 to 10, depend- 
ing on the supply voltage and the temperature; does not apply to other 
technologies (where it is set to 1). 

tte Environment shock factor; ranges from very low (about 0.4), when the 
component is in an air-conditioned office environment, to very high (13.0) 
when it is in a harsh environment. 
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Ci, C 2 Complexity factors; functions of the number of gates on the chip and the 
number of pins in the package. 

Further details can be found in MIL-HDBK-217E, which is a handbook pro- 
duced by the U.S. Department of Defense. 

Devices operating in space, which is replete with charged particles and can sub- 
ject devices to severe temperature swings, can thus be expected to fail much more 
often than their counterparts in air-conditioned offices, so too can computers in 
automobiles (which suffer high temperatures and vibration) and industrial appli- 
cations. 



2.2 Failure Rate, Reliability, and 
Mean Time to Failure 



In this section, we consider a single component of a more complex system, and 
show how reliability and Mean Time to Failure (MTTF) can be derived from the 
basic notion of failure rate. Consider a component that is operational at time f = 0 
and remains operational until it is hit by a failure. Suppose now that all failures 
are permanent and irreparable. Let T denote the lifetime of the component (the 
time until it fails), and let /(f) and F(t) denote the probability density function of 
T and the cumulative distribution function of T, respectively. These functions are 
defined for f ^ 0 only (because the lifetime cannot be negative) and are related 
through 

/(0=d elf' F(0 = l /(T)dT (2-2) 

/(f) represents (but is not equal to) the momentary probability of failure at time 
f. To be exact, for a very small Af,/(f)Af ~ Prob{f C T C f + At). Being a density 
function, /(f) must satisfy 



/(f) ^ 0 for f > 0 and 




= 1 



F(f) is the probability that the component will fail at or before time f. 



F(f) = Prob{T < f} 



R(f), the reliability of a component (the probability that it will survive at least until 
time f), is given by 

R(t) = ProbfT > f} = 1 — F(f) (2.3) 

/(f) represents the probability that a new component will fail at time f in the future. 
A more meaningful quantity is the probability that a good component of current 
age f will fail in the next instant of length df. This is a conditional probability, since 




14 



CHAPTER 2 Hardware Fault Tolerance 



we know that the component survived at least until time f. This conditional prob- 
ability is represented by the failure rate (also called the hazard rate) of a component 
at time t, denoted by X(t), which can be calculated as follows: 



m = 



m 

i -m 



Since = —/(f), we obtain 



m = - 



1 d R(t) 
R(t) d t 



(2.4) 



(2.5) 



Certain types of components suffer no aging and have a failure rate that is constant 
over time, X(t) = X. In this case. 



d R(t) 
d t 



= —XR(t) 



and the solution of this differential equation (with R( 0) = 1) is 

R(t) = e~ xt 



( 2 . 6 ) 



Therefore, a constant failure rate implies that the lifetime T of the component has 
an exponential distribution, with a parameter that is equal to the constant failure 
rate X 

/(f) = Xe~ xt F(t) = 1 - e~ xt R(t ) = e~ xt for f ^ 0 

For an irreparable component, the MTTF is equal to its expected lifetime, £[T] 
(where E[ ] denotes the expectation or mean of a random variable) 



MTTF = E[T] = 




(2.7) 



Substituting = —/(f) yields 

r°° d R(t) r°° 

MTTF — — f-^df=-fR(f)| “+ j R(f)df = j R(t)dt (2.8) 

(the term — tR(t ) is equal to zero when f = 0 and when f = oo, since R(oc) = 0). 

For the case of a constant failure rate for which R(t ) = e~ xt , 

roc i 

MTTF = / e“ Af df=- 
J 0 A 

Although a constant failure rate is used in most calculations of reliability (mainly 
owing to the simplified derivations), there are cases for which this simplifying 
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assumption is inappropriate, especially during the "infant mortality" and "wear- 
out” phases of a component's life (Figure 2.1). In such cases, the Weibull distrib- 
ution is often used. This distribution has two parameters, X and P, and has the 
following density function of the lifetime T of a component: 

/(f) = Xpt p ~ l e~ ktP (2.9) 

The corresponding failure rate is 



X(t) = Xp t p ~ l (2.10) 

This failure rate is an increasing function of time for p > 1, is constant for P — 1, 
and is a decreasing function of time for p < 1 . This makes it very flexible, and es- 
pecially appropriate for the wear-out and infant mortality phases. The component 
reliability for a Weibull distribution is 

R(t) = e~ xt/! (2.11) 

and the MTTF of the component is 

MTTF = } (2.12) 

px p- 1 

where T(x) = y x ~ 1 e~' 1 dy is the Gamma function. The Gamma function is a gen- 

eralization of the factorial function to real numbers, and satisfies 

■ T(x) = (x — l)T(x — 1) for x > 1; 

■ T(0) = T(l) = 1; 

■ r(n) = (n-l)! for an integer n, n = 1, 2, 

Note that the Weibull distribution includes as a special case (P — 1) the exponential 
distribution with a constant failure rate X. 

With these preliminaries, we now turn to structures that consist of more than 
one component. 



2.3 Canonical and Resilient Structures 

In this section, we consider some canonical structures, out of which more complex 
structures can be constructed. We start with the basic series and parallel struc- 
tures, continue with non-series / parallel ones, and then describe some of the many 
resilient structures that incorporate redundant components (next referred to as 
modules). 
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(a) Series system (b) Parallel system 



FIGURE 2.2 Series and parallel systems. 



2.3.1 Series and Parallel Systems 

The most basic structures are the series and parallel systems depicted in Figure 2.2. 
A series system is defined as a set of N modules connected together so that the fail- 
ure of any one module causes the entire system to fail. Note that the diagram in 
Figure 2.2a is a reliability diagram and not always an electrical one; the output of 
the first module is not necessarily connected to the input of the second module. 
The four modules in this diagram can, for example, represent the instruction de- 
code unit, execution unit, data cache, and instruction cache in a microprocessor. 
All four units must be fault-free for the microprocessor to function, although the 
way they are connected does not resemble a series system. 

Assuming that the modules in Figure 2.2a fail independently of each other, the 
reliability of the entire series system is the product of the reliabilities of its N mod- 
ules. Denoting by R/(f) the reliability of module i and by R s (f) the reliability of the 
whole system, 

N 

R s (t) = f] R,(t) (2.13) 

i= 1 

If module i has a constant failure rate, denoted by /.,, then, according to Equa- 
tion 2.6, Ri(t) = e~ Xit , and consequently. 



Rs(t) = e~ lst (2.14) 

where X s = Yl fL] A- From Equation 2.14 we see that the series system has a con- 
stant failure rate equal to A. s (the sum of the individual failure rates), and its MTTF 
is therefore MTTF S = X . 

A parallel system is defined as a set of N modules connected together so that 
it requires the failure of all the modules for the system to fail. This leads to the 
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FIGURE 2.3 A non-series/parallel system. 

following expression for the reliability of a parallel system, denoted by Rp(t): 

N 

Rp(t) = l-l\(l-Ri(t)) ( 2 . 15 ) 

('= 1 

If module i has a constant failure rate /.„ then 

N 

Rp(t) = 1 — J - [(1 — e~ kit ) ( 2 . 16 ) 

;= l 

As an example, the reliability of a parallel system consisting of two modules with 
constant failure rates and /.2 is given by 

Rp(t) = e~ Xlt + e~ k2t - e ~ {kl+k2)t 

Note that a parallel system does not have a constant failure rate; its failure rate 
decreases with each failure of a module. It can be shown that the MTTF of a parallel 
system with all its modules having the same failure rate X is MTTF^, = Yj?=i ry ■ 

2.3.2 Non-Series/Parallel Systems 

Not all systems have a reliability diagram with a series /parallel structure. Fig- 
ure 2.3 depicts a non-series/ parallel system whose reliability cannot be calculated 
using either Equation 2.13 or 2.15. Each path in Figure 2.3 represents a configu- 
ration that allows the system to operate successfully. For example, the path ADF 
means that the system operates successfully if all three modules A, D and F are 
fault-free. A path in such reliability diagrams is valid only if all modules and 
edges are traversed from left to right. The path BCDF in Figure 2.3 is thus in- 
valid. No graph transformations that may result in violations of this rule are al- 
lowed. 
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(a) C not working 




(b) C working 



FIGURE 2.4 Expanding the diagram in Figure 2.3 about module C. 

In the following analysis, the dependence of the reliability on the time f is omit- 
ted for simplicity of notation, although it is implied that all reliabilities are func- 
tions of t. 

We calculate the reliability of the non-series /parallel system in Figure 2.3 by ex- 
panding about a single module i. That is, we condition on whether or not module 
i is functional, and use the Total Probability formula. 

^system = Ri • ProbjSystem works|/ is fault-free} 

+ (1 — R{) ■ ProbjSystem works \i is faulty} (2.17) 

where, as before, R ; - denotes the reliability of module i (i — A,B,C,D,E,F). We can 
now draw two new diagrams. In the first, module i will be assumed to be working, 
and in the second, module i will be faulty. Module i is selected so that the two new 
diagrams are as close as possible to simple series/ parallel structures for which we 
can then use Equations 2.13 and 2.15. Selecting module C in Figure 2.3 results in 
the two diagrams in Figure 2.4. The process of expanding is then repeated until 
the resulting diagrams are of the series /parallel type. Figure 2.4a is already of 
the series/parallel type, whereas Figure 2.4b needs further expansion about E. 
Note that Figure 2.4b should not be viewed as a parallel connection of A and B, 
connected serially to D and E in parallel; such a diagram will have the path BCDF, 
which is not a valid path in Figure 2.3. Based on Figure 2.4 we can write, using 
Equation 2.17, 



^system = Rc ' ProbjSystem works|C is fault-free} 

+ (1 - Rc)Rf[ 1 - (1 - RaRd)( 1 - RbRe )] 



(2.18) 
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Expanding the diagram in Figure 2.4b about £ yields 
ProbjSystem works|C is fault-free} 

= ReRf[ 1 - (1 - RaX 1 - Rb)] + (1 - Re)R A RdRf 

Substituting this last expression in 2.18 results in 

^system = Rc[ReRf(Ra + Rb - R a Rb) + (1 - Re)R a RdRf] 
+ (1 - Rc)[Rf(R A Rd + RbRe - R a RdRbRe )] 



(2.19) 



If R/ \ — Rb — Rc — Rd = Re = Rf — R> then 

^system — R ^ (R? ~ 3 R 3 + R + 2) (2.20) 

If the diagram of the non-series/ parallel structure is too complicated to apply the 
above procedure, upper and lower bounds on (^system can be calculated instead. 
An upper bound is given by 

^system ^ 1 — j j (1 — kpath i) (2.21) 

where R pa th i is the reliability of the series connection of the modules along path i. 
The bound in Equation 2.21 assumes that all the paths are in parallel and that they 
are independent. In reality, two of these paths may have a module in common, 
and the failure of this module will result in both paths becoming faulty. That is 
why Equation 2.21 provides only an upper bound rather than an exact value. As 
an example, let us calculate the upper bound for Figure 2.3. The paths are ADF , 
BEF, and ACEF, resulting in 

^system < 1 - (1 - R A RdRf)( 1 - R B ReRf)(^ ~ R A RcReRf) (2.22) 

If R a = R b = R c = R d = R e = R f = R, then ^system R 3 (R 7 - 2 £ 4 - R 3 + R + 2), 
which is less accurate than the exact calculation in Equation 2.20. 

The upper bound can be used to derive the exact reliability, by performing the 
multiplication in Equation 2.22 (or Equation 2.21 in the general case) and replacing 
every occurrence of R * by Rj. Since each module is used only once, its reliability 
should not be raised to any power greater than 1. The reader is invited to verify 
that applying this rule to the upper bound in Equation 2.22 yields the same exact 
reliability as in Equation 2.19. 

A lower bound can be calculated based on minimal cut sets of the system dia- 
gram, where a minimal cut set is a minimal list of modules such that the removal 
(due to faults) of all modules from the set will cause a working system to fail. The 
lower bound is obtained by 



R 



system 






na-Qcut,) 



(2.23) 
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FIGURE 2.5 Comparing the exact reliability of the non-series/parallel system in Fig- 
ure 2.3 to its upper and lower bounds. 



where Q CLlt , is the probability that minimal cut i is faulty. In Figure 2.3, the minimal 
cut sets are F, AB, AE, DE, and BCD. Consequently, 

^system ^ Rf[ 1 - (1 - Ra)( 1 - Rb)] [1 - (1 - RaX 1 - Re)] [1 - (1 - Rd)(1 - Re)] 
x[1-(1-R b )(1-R c )(1-Rd)] (2.24) 

If Ra — Rb — Rc — Rd — Re = Rf — R, then -R S ystem S* -R 5 (24 — 60 R + 62 R~ - 33 R 3 + 
9R 4 — R 5 ). Figure 2.5 compares the upper and lower bounds to the exact system 
reliability for the case in which all six modules have the same reliability R. Note 
that in this case, for the more likely high values of R, the lower bound provides a 
very good estimate for the system reliability 

2.3.3 M-of-N Systems 

An M-of-N system is a system that consists of N modules and needs at least M of 
them for proper operation. Thus, the system fails when fewer than M modules are 
functional. The best-known example of this type of systems is the triplex, which 
consists of three identical modules whose outputs are voted on. This is a 2-of-3 
system: so long as a majority (2 or 3) of the modules produce correct results, the 
system will be functional. 

Let us now compute the reliability of an M-of-N system. We assume as before 
that the failures of the different modules are statistically independent and that 
there is no repair of failing modules. If R(f) is the reliability of an individual mod- 
ule (the probability that the module is still operational at time f), the reliability 
of an M-of-N system is the probability that M or more modules are functional at 
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FIGURE 2.6 A Triple Modular Redundant (TMR) structure. 

time t. The system reliability is therefore given by 

N , . 

R M _oi_N(t) = E ( 7 ) R! ’( f )[! - m] N ~' (2.25) 

(=M ' ' 

where (7) = Th e assumption that failures are independent is key to the 

high reliability of M-of-N systems. Even a slight extent of positively correlated 
failures can greatly diminish their reliability. For example, suppose q COI is the prob- 
ability that the entire system suffers a common failure. The reliability of the system 
now becomes 



N , . 

R7_ of_N« = (1 - <7cor) E(f) R, « [1 - *(*)f (2.26) 

i=M R ' 

If the system is not designed carefully, the correlated failure factor can dominate 
the overall failure probability 

In practice, correlated failure rates can be extremely difficult to estimate. In 
Equation 2.26, we assumed that there was a failure mode in which the entire clus- 
ter of N modules suffers a common failure. However, there are other modes as 
well, in which subsets of the N modules could undergo a correlated failure. There 
being 2 N — N — 1 subsets containing two or more modules, it quickly becomes in- 
feasible to obtain by experiment or otherwise the correlated failure probabilities 
associated with each of the subsets, even for moderate values of N. 

Perhaps the most important M-of-N system is the triplex, or the Triple Modular 
Redundant (TMR) cluster shown in Figure 2.6. In such a system, M — 2 and N — 3, 
and a voter selects the majority output. If a single voter is used, that voter becomes 
a critical point of failure and the reliability of the cluster is 

RTMR(0 = Rvoter(f) E ( / ) *'’(*) t 1 “ R (0] 3 “' 
i = 2 ^ ' 

= Rvoter(f)(3R 2 (f)[l - R(0] +R 3 (t)) = R vot er(f)(3R 2 (f) - 2 R 3 (f)) (2.27) 

where R vo ter (f) is the reliability of the voter. 
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FIGURE 2.7 Comparing NMR reliability (for N = 3 and 5) to that of a single module 
(voter failure rate is considered negligible). 



The general case of TMR is called N-modular redundancy (NMR) and is an 
M-of-N cluster with N odd and M— [N/2] . 

In Figure 2.7, we plot the reliability of a simplex (a single module), a triplex 
(TMR), and an NMR cluster with N — 5. For high values of R(f), the greater the 
redundancy, the higher the system reliability As R(t) decreases, the advantages 
of redundancy become less marked; until for R(t) < 0.5, redundancy actually be- 
comes a disadvantage, with the simplex being more reliable than either of the re- 
dundant arrangements. This is also reflected in the value of MTTFtmR/ which (for 
^voter(f) = 1 and R(t) = e -Af ) can be calculated based on Equation 2.8 as 

roo roo c 

MTTFtmr = / (3 R 2 (t) - 2 R 3 (f)) df = / (3e“ 2At - 2e“ 3Ai ) dt = — 

Jo Jo oX 

< ~ = MTTFsi m pl ex 

In most applications, however, R(t) X>> 0.5 for realistic t and the system is repaired 
or replaced long before R(t) < 0.5, so a triplex arrangement does offer significant 
reliability gains. 

Equation 2.27 was derived under the conservative assumption that every fail- 
ure of the voter will lead to erroneous system output and that any failure of two 
modules is fatal. This is not necessarily the case. If, for example, one module has 
a permanent logical 1 on one of its outputs and a second module has a perma- 
nent logical 0 on its corresponding output, the TMR (or NMR) will still function 
properly. Clearly, a similar situation may arise regarding certain faults within the 
voter circuit. These are examples of compensating faults. Another case of faults 
that may be harmless are non-overlapping faults. For example, one module may 
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have a faulty adder and another module a faulty multiplier. If the adder and mul- 
tiplier circuits are disjoint, the two faulty modules are unlikely to generate wrong 
outputs simultaneously If all compensating and non-overlapping faults are taken 
into account, the resulting reliability will be higher than that predicted by Equa- 
tion 2.27. 

2.3.4 Voters 

A voter receives inputs X\,X 2 , ■ ■ ■ ,*n from an M-of-N cluster and generates a repre- 
sentative output. The simplest voter is one that does a bit-by-bit comparison of the 
outputs, and checks if a majority of the N inputs are identical. If so, it outputs the 
majority. This approach only works when we can guarantee that every functional 
module will generate an output that matches the output of every other functional 
module, bit by bit. This will be the case if the modules are identical processors, use 
identical inputs and identical software, and have mutually synchronized clocks. 

If, however, the modules are different processors or are running different soft- 
ware for the same problem, it is possible for two correct outputs to diverge slightly, 
in the lower significant bits. Hence, we can declare two outputs x and y as practi- 
cally identical if \x — y\ < S for some specified S. (Note that "practically identical" 
is not transitive; if A is practically identical to B and B is practically identical to C, 
this does not necessarily mean that A is practically identical to C.) 

For such approximate agreement, we can do plurality voting. A k-plurality voter 
looks for a set of at least k practically identical outputs (this is a set in which each 
member is practically identical to all other members) and picks any of them (or the 
median) as the representative. For example, if we set S — 0.1 and the five outputs 
were 1.10,1.11,1.32,1.49,3.00, then the subset {1.10,1.11} would be selected by a 
2-plurality voter. 

In our discussion so far, we have implicitly assumed that each output has an 
equal chance of being faulty. In some cases that may not be true; the hardware (or 
software) producing one output may have a different failure probability than does 
the hardware (or software) producing another output. In this case, each output is 
assigned a weight that is related to its probability of being correct. The voter then 
does weighted voting and produces an output that is associated with over half the 
sum of all weights. 

2.3.5 Variations on JV-Modular Redundancy 

Unit-Level Modular Redundancy 

In addition to applying replication and voting at the level of the entire system, the 
same idea can be applied at the subsystem level as well. Figure 2.8 shows how 
triple-modular replication can be applied at the individual unit level for a system 
consisting of four units. In such a scheme, the voters are no longer as critical as in 
NMR. A single faulty voter will cause no more harm than a single faulty unit, and 
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FIGURE 2.9 Triplicated voters in a processor/memory TMR. 

the effect of either one will not propagate beyond the next level of units. Clearly, 
the level at which replication and voting are applied can be further lowered at the 
expense of additional voters, increasing the overall size and delay of the system. 

Of particular interest is the triplicated processor /memory system shown in 
Figure 2.9. Here, all communications (in either direction) between the tripli- 
cated processors and triplicated memories go through majority voting. This or- 
ganization is more reliable than a single majority voting of a triplicated proces- 
sor/memory structure. 

Dynamic Redundancy 

The above variations of NMR employ considerable amounts of hardware in order 
to instantaneously mask errors that may occur during the operation of the sys- 
tem. However, in many applications, temporary erroneous results are acceptable 
as long as the system is capable of detecting such errors and reconfiguring itself by 
replacing the faulty module with a fault-free spare module. An example of such 
a dynamic (or active) redundancy scheme is depicted in Figure 2.10, in which the 
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FIGURE 2.10 Dynamic redundancy. 

system consists of one active module, N spare modules, and a Fault Detection and 
Reconfiguration unit that is assumed to be capable of detecting any erroneous out- 
put produced by the active module, disconnecting the faulty active module, and 
connecting instead a fault-free spare (if one exists). 

Note that if all the spare modules are active (powered), we expect them to have 
the same failure rate as the single active module. This dynamic redundancy struc- 
ture is, therefore, similar to the basic parallel system in Figure 2.2, and its reliability 
is given by 

^dynamic (f) = R d ru(0(l ~ [l ~ R(t)] N+1 ) (2-28) 

where R(t) is the reliability of each module, and R ( j m (f) is the reliability of the 
detection and reconfiguration unit. If, however, the spare modules are not pow- 
ered (in order to conserve energy), they may have a negligible failure rate when 
not in operation. Denoting by c the coverage factor, defined as the probability that 
the faulty active module will be correctly diagnosed and disconnected and a good 
spare will be successfully connected, we can derive the system reliability for very 
large N by arguing as follows. 

Failures to the active module occur at rate X. The probability that a given such 
failure cannot be recovered from is 1 — c. Flence, the rate at which unrecoverable 
failures occur is (1 — c)X. The probability that no unrecoverable failure occurs to 
the active processor over a duration t is therefore given by the reliability 

of the reconfiguration unit is given by Rdru(0- We therefore have: 

^dynamic(t) = Rdru(f)e _(1_C)W (2.29) 



Hybrid Redundancy 

An NMR system is capable of masking permanent and intermittent failures, but 
as we have seen, its reliability drops below that of a single module for very long 
mission times if no repair or replacement are taking place. The objective of hy- 
brid redundancy is to overcome this by adding spare modules that will be used 
to replace active modules once they become faulty. Figure 2.11 depicts a hybrid 
system consisting of a core of N processors constituting an NMR, and a set of K 
spares. The outputs of the active primary modules are compared (by the Compare 
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FIGURE 2.11 Hybrid redundancy. 

unit) to the output of the voter to identify a faulty primary (if any). The Compare 
unit then generates the corresponding disagreement signal, which will cause the 
Reconfiguration unit to disconnect the faulty primary and connect a spare module 
instead. 

The reliability of a hybrid system with a TMR core and K spares is 

R hybrid (0 = Rvoter(f)Rrec(f)(l - mR(t)[ 1 - R^f 1 - [l - R(t)f) (2.30) 

where m — K + 3 is the total number of modules, and R v o ter(0 and R rec (f) are the 
reliability of the voter and the comparison and reconfiguration circuitry, respec- 
tively. Equation 2.30 assumes that any fault in either the voter or the comparison 
and reconfiguration circuit will cause a system failure. In practice, not all faults 
in these circuits are fatal, and the reliability of the hybrid system will be higher 
than what is predicted by Equation 2.30. A more accurate value of R hybrid (0 can 
be obtained through a detailed analysis of the voter and the comparison and re- 
configuration circuits and the different ways in which they can fail. 

Sift-Out Modular Redundancy 

As in NMR, all N modules in the Sift-out Modular Redundancy scheme (see Fig- 
ure 2.12) are active, and the system is operational as long as there are at least two 
fault-free modules. Unlike NMR, this system uses comparator, detector, and col- 
lector circuits instead of a majority voter. The comparator compares the outputs 
of all pairs of modules, so that Ey =1 if the outputs of modules i and j do not 
match. Based on these signals, the detector determines which modules are faulty 
and generates the logical outputs Fi,F 2 , . . . , F\;, where F, = 1 if module i has been 
determined to be faulty and 0 otherwise. Finally, the collector unit produces the 
system output, which is the OR of the outputs of all fault-free modules. This way, a 
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FIGURE 2.12 Sift-out structure. 




FIGURE 2.13 Duplex system. 

module whose output disagrees with the outputs of the other modules is switched 
out and no longer contributes to the system output. The implementation of this 
scheme is simpler than that of hybrid redundancy. 

Care must be taken, however, not to be too aggressive in the purging (sifting- 
out) process. The vast majority of failures tend to be transient and disappear on 
their own after some time. It is preferable, therefore, to only purge a module if it 
produces incorrect outputs over a sustained interval of time. 

2.3.6 Duplex Systems 

A duplex system is the simplest example of module redundancy. Figure 2.13 
shows an example of a duplex system consisting of two processors and a com- 
parator. Both processors execute the same task, and if the comparator finds that 
their outputs are in agreement, the result is assumed to be correct. The implicit 
assumption here is that it is highly unlikely for both processors to suffer identical 
hardware failures that result in their both producing identical wrong results. If, on 
the other hand, the results are different, there is a fault, and higher-level software 
has to decide how it is to be handled. 

The fact that the two processors disagree does not, by itself, allow us to identify 
the failed processor. This can be done using one of several methods, some of which 
we will consider below. To derive the reliability of the duplex system we denote, 
as before, by c the coverage factor, which is the probability that a faulty processor 
will be correctly diagnosed, identified, and disconnected. 
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Assuming that the two processors are identical, each with a reliability R(f), the 
reliability of the duplex system is 

^duplex (f) Rcomp(0(R 2 (0 + 2cR(f)[l - m]) (2.31) 

where R c 0 mp is the reliability of the comparator. Assuming a fixed failure rate of X 
for each processor and an ideal comparator (R CO mp(f) = 1)/ the MTTF of the duplex 
system is 

MTTF dup i ex = — + — 

The main difference between a duplex and a TMR system is that in a duplex, the 
faulty processor must be identified. We discuss next the various ways in which 
this can be done. 

Acceptance Tests 

The first method for identifying the faulty processor is to carry out a check of 
each processor's output and is known as an acceptance test. One example of an 
acceptance test is a range test, which checks if the output is in the expected range. 
This is a basic and simple test, which usually works very well. For example, if the 
output of a processor is supposed to indicate the predicted pressure in a container 
(for gases or liquids), we would know the range of pressures that the container 
can hold. Any output outside those values results in the output being flagged as 
faulty. We are therefore using semantic information of the task to predict which 
values of output indicate an error. 

The question is now how to determine the range of acceptable values. The nar- 
rower this range, the greater the probability that an incorrect output will be iden- 
tified as such but so is the probability that a correct output will be misidentified as 
erroneous. We define the sensitivity of a test as the conditional probability that the 
test detects an error given that the output is actually erroneous, and the specificity 
of a test as the conditional probability that the output is erroneous, given that the 
acceptance test declares an error. A narrow range acceptance test will have high 
sensitivity but low specificity, which means that the test is very likely not to miss an 
erroneous output but at the same time it is likely to get many false-positive results 
(correct results that the test declares faulty). 

The reverse happens when we make this range very wide: then we have low 
sensitivity but high specificity. We will consider this problem again when we dis- 
cuss recovery block approaches to software fault tolerance in Chapter 5. 

Range tests are the simplest, but by no means the only, acceptance test mecha- 
nism. Any other test that can discriminate reasonably accurately between a correct 
and an incorrect output can be used. For instance, suppose we want to check the 
correctness of a square-root operation; since (Vx ) 2 = x, we can square the output 
and check if it is the same as the input (or sufficiently close, depending on the level 
of precision used). 
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Hardware Testing 

The second method of identifying the failed processor is to subject both processors 
to some hardware/ logic test routines. Such diagnostic tests are regularly used to 
verify that the processor circuitry is functioning properly, but running them can 
identify the processor that produced the erroneous output only if a permanent 
fault is present in that processor. Since most hardware faults are transient, hard- 
ware testing has a low probability of identifying the processor that failed to pro- 
duce the correct output. 

Even if the hardware fault is permanent, running hardware tests does not guar- 
antee that the fault will be detected. In practice, hardware tests are never perfect, 
and there is a non-zero probability that the test passes as good a processor which is 
actually faulty. The test sensitivity, or the probability of the test identifying a faulty 
processor as such, is in the case of hardware tests often called the test coverage. 

Forward Recovery 

A third method for identifying the faulty processor in a duplex is to use a third 
processor to repeat the computations carried out by the duplex. If only one of the 
three processors (the duplex plus this new processor) is faulty, then whichever 
processor the third disagrees with is the faulty one. 

It is also possible to use a combination of these methods. The acceptance test is 
the quickest to run but is often the least sensitive. The result of the acceptance test 
can be used as a provisional indication of which processor is faulty, and this can 
be confirmed by using either of the other two approaches. 

Pair-and-Spare System 

Several more complicated resilient structures have been proposed that use the du- 
plex as their building block. The first such system that we describe is the pair- 
and-spare system (see Figure 2.14), in which modules are grouped in pairs, and 
each pair has a comparator that checks if the two outputs are equal (or sufficiently 
close). If the outputs of the two primary modules do not match, this indicates an 
error in at least one of them but does not indicate which one is in error. Running 
diagnostic tests, as described in the previous section, will result in a disruption in 
service. To avoid such a disruption, the entire pair is disconnected and the com- 
putation is transferred to a spare pair. The two members of the switched-out pair 
can now be tested offline to determine whether the error was due to a transient or 
permanent fault. In the case of a transient fault, the pair can eventually be marked 
as a good spare. 

Triplex-Duplex System 

Another duplex-based structure is the triplex-duplex system. Here, processors 
are tied together to form duplexes, and then, a triplex is formed out of these du- 
plexes. When the processors in a duplex disagree, both of them are switched out of 
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FIGURE 2.14 A pair-and-spare structure consisting of two duplexes. 

the system. The triplex-duplex arrangement allows for the error masking of vot- 
ing combined with a simpler identification of faulty processors. Furthermore, the 
triplex can continue to function even if only one duplex is left functional, because 
the duplex arrangement allows the detection of faults. Deriving the reliability of a 
triplex-duplex system is reasonably simple and is left for the reader as an exercise. 



2.4 Other Reliability Evaluation Techniques 

Most of the structures that we have described so far have been simple enough to 
allow reliability derivations using straightforward, and relatively simple, combi- 
natorial arguments. Analysis of more complex resilient structures requires more 
advanced reliability evaluation techniques, some of which are described next. 



2.4.1 Poisson Processes 

Consider non-deterministic events of some sort, occurring over time with the fol- 
lowing probabilistic behavior: 

For a time interval of very short length At, 

1 . The probability of one event occurring during the interval At is, for some con- 
stant X, X At plus terms of order At 2 . 

2 . The probability of more than one event occurring during At is negligible (of 
the order of At 2 ). 

3 . The probability of no events occurring during the interval At is 1 — XAt plus 
terms of order At 2 . 

Let N(t) denote the number of events occurring in an interval of length t, and let 

Pj.(f) = Prob{N(f) = k) be the probability of exactly k events occurring during an 
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interval of length t (k — 0, 1,2, . . .). Based on (l)-(3), we have 

P k (t + At) P k -i(t)X At + P k (t)( 1 - 1 At) (for k — 1,2,...) 



and 



P 0 (t+ At) ^P 0 (t)(l- X At) 



These approximations become more accurate as At -> 0, and lead to the differen- 
tial equations: 

^jP-=XP k - 1 (t)-XP k (t) (for k= 1,2,. . .) 

and 



dPp(f) 
d t 



= —kPo(t) 



Using the initial condition P o(0) = 1, the solution to this set of differential equa- 
tions is 



P k (t) = Prob{7V(f) = k\ 




(for k — 0,1/2, . . .) 



A process N(t) with this probability distribution is called a Poisson process with 
rate X. A Poisson process with rate X has the following properties: 



1 . The expected number of events occurring in an interval of length t is Xt. 

2 . The length of time between consecutive events is an exponentially distributed 
random variable with parameter X and mean value 1 / X. 

3 . The numbers of events occurring in disjoint intervals of time are independent 
of one another. 

4 . The sum of two independent Poisson processes with rates /. ] and X 2 is itself a 
Poisson process with rate X\ + X 2 . 



As an example for the use of the Poisson process we consider a duplex system, 
consisting of two active identical processors with an unlimited number of spares. 
The two active processors are subject to failures occurring at a constant rate of 
X per processor. The spares, however, are assumed to always be functional (they 
have a negligible failure rate so long as they are not active). 

When a failure occurs in an active processor, it must be detected and a new 
processor inducted into the duplex to replace the one that just failed. As before, 
we define the coverage factor c as the probability of successful detection and in- 
duction. We, however, assume for simplicity that the comparator failure rate is 
negligible and that the induction process of a new processor is instantaneous. 

Let us now calculate the reliability of this duplex system over the time interval 
[0, t]. We first concentrate on the failure process in one of the two processors. When 
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a processor fails (due to a permanent fault), it is diagnosed and replaced instan- 
taneously. Due to the constant failure rate A, the time between two consecutive 
failures of the same processor is exponentially distributed with parameter A. This 
implies that N(t), the number of failures that occur in this one processor during 
the time interval [0, t], is a Poisson process with the rate A. 

Since the duplex has two active processors, the number of failures that occur in 
the duplex is the sum of the numbers of failures of the two processors, and hence, 
it is also a Poisson process (denoted by M(t)) with rate 2A. The probability that k 
failures occur in the duplex over an interval of duration f is thus 

(2Xt) k 

Prob{/c failures in duplex} = Prob{M(f) = k} = e~ 2u — — — (2.32) 

For the duplex system not to fail, each of these failures must be detected and the 
processor successfully replaced. The probability of one such success is the cover- 
age factor c, and the probability that the system will survive k failures is c k . The 
reliability of the duplex over the interval [0, t] is therefore 



P duplex(0 — X Prob l fc failures in duplex} • c k = X e 

*=0 k= 0 



kj c 



-m (2A tf c‘ 
k\ 



= e 



—2 It (Zktc)* 



T 

k=0 



k\ 



= e ~ 2u e 2xtc 



= e -2A(l -c)t 



(2.33) 



In our derivation, we have used the fact that 

v-2 



X ^ A A 

= i + .v+-+...=y:- 



OO l- 

x 



k = 0 



k\ 



We could have obtained the expression in 2.33 more directly using the type of 
reasoning we employed in the analysis of hybrid redundancy. To reiterate, the 
steps are as follows: 

1. Individual processors fail at a rate A, and so processor failures occur in the 
duplex at the rate 2A. 

2 . Each processor failure has a probability c of being successfully dealt with, and 
a probability 1 — c of causing failure to the duplex. 

3 . As a result, failures that crash the duplex occur with rate 2A(1 — c). 

4 . The reliability of the system is thus e~ 2; - ( l ~d f . 




